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RANDOM BIOCHEMICAL NETWORKS: THE PROBABILITY OF 
SELF-SUSTAINING AUTOCATALYSIS 

ELCHANAN MOSSEL, MIKE STEEL 



Abstract. We determine conditions under which a random biochemical sys- 
tem is hkely to contain a subsystem that is both autocatalytic and able to 
survive on some ambient 'food' source. Such systems have previously been 
investigated for their relevance to origin-of-life models. In this paper we ex- 
tend earlier work, by finding precisely the order of catalysation required for 
the emergence of such self-sustaining autocatalytic networks. This answers 
questions raised in earlier papers, yet also allows for a more general class of 
models. We also show that a recently-described polynomial time algorithm 
for determining whether a catalytic reaction system contains an autocatalytic, 
O , self-sustaining subsystem is unlikely to adapt to allow inhibitory catalysation 

- in this case we show that the associated decision problem is NP-complete. 
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1. Introduction 

The idea that the study of discrete random networks could provide some in- 
sight into the problem of how primitive life might have emerged from an ambient 
'soup' of molecules goes back the mid-1980s. This was largely motivated by the 
earlier investigation of random graphs, pioneered by Alfred Renyi and Paul Erdos 
in the 1950's and 1960s, which had revealed the widespread occurrence of 'thresh- 
old phenomena' (sometimes also called 'phase transitions') in properties of these 
graphs (Erdos and Renyi, 1960). In the simplest random graph model one has set 
of vertices (points) and edges are added independently and randomly between pairs 
of vertices. As the probability that any two nodes are jointed by an edge passes 
certain well-studied thresholds, there is typically a fundamental change in various 
qualitative properties of a large random graph, such as its connectivity, or the size 
of the largest component (see eg. Bollobas, 2001). Extending this approach, Bol- 
lobas and Rasmussen (1989) investigated when a directed cycle would first emerge 
in a random directed graph, and how many vertices such a cycle would contains. 
They were motivated by the idea that the emergence of a primitive metabolic cycle 
was an essential step in the early history of life, writing "we want to know when 
the first catalytic feedbacks appear, and how many different RNA molecules they 
involve." Cohen (1988) also foresaw the relevance of random graph techniques for 
modelling primitive biological processes. 

The importance of cycles in early life had also been studied - from a slightly dif- 
ferent perspective - by Eigen (1971) and Eigen and Schuster (1979). They proposed 
a metabolic 'hypercycle' as a way of circumventing the so-called 'error catastrophe' 
in the formation of longer strings of nucleotides, first demonstrated by Maynard- 
Smith (1983). The study of such processes and how they might further evolve into 
early life has been extensively investigated, using both stochastic and dynamical 
approaches (eg. Scheuring, 2000; Wills and Henderson, 1997; Zintzaras, Santos and 
Szathmary, 2002). 

The idea that threshold phenomena might help explain some of the mystery 
surrounding the emergence of life-like systems from a soup of inanimate molecules 
was developed further by Dyson (1982, 1985) and Stuart Kauffman (1986, 1993). 
Kauffman considered simple autocatalytic protein networks where amino acid se- 
quences catalyse the joining (or 'ligation') of shorter sequences, and the cutting (or 
'cleavage') of longer sequences. He calculated that under a simple model of random 
catalysation, once the collection of sequences became sufficiently extensive there 
would inevitably emerge a subsystem of reactions that was both autocatalytic and 
able to be sustained from an ambient supply of short sequences (such as single or 
pairs of amino acids). Kauffman realised that simple random graphs and digraphs 
by themselves do not capture the intricacy of chemical reactions and catalysis. A 
more complex discrete structure - which has become known as a catalytic reac- 
tion system is required in order to formalize and study the concept of a system of 
molecules that catalyses all the reactions required for their generation, and which 
can be sustained from some ambient 'food' source of molecules, F. A different 
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discrete model for self-reproducing systems based on Petri nets has also been devel- 
oped by Sharov (1991) for investigating the dynamical properties of these systems, 
but we do not deal with this model here. 

Several investigators have developed the studey of catalytic reaction systems 
and random autocatalysis (Hordijk and Fontanari, 2003; Hordijk and Steel, 2004; 
Lohn et ai, 1998; Wills and Henderson, 1997) though it also has its critics (eg. 
Lifson, 1997; Orgel, 1992; Maynard Smith and Szathmary 1995) and these criticisms 
are mainly of two types. Firstly Kauffman invoked overly simplistic and strong 
assumptions in his analysis - for example he considered just binary sequences (i.e. 
two amino acids) and assumed that each molecule had the same fixed probability 
of catalysing any given reaction. In this paper we make much weaker, and thereby 
hopefully more robust assumptions in our probabilistic analysis. A second concern 
is more general - the concept of a 'protein-first' start to life is problematic, since 
proteins, unlike RNA are not able to replicate (for a discussion of this point, part 
of the so-called 'chicken and egg' problem see Lifson, 1997; Maynard Smith and 
Szathmary 1995; or Penny 2004). Thus it is quite likely that other sequences 
besides proteins (such as RNA) may have been part of the first prebiotic systems, 
and there has been considerable interest from biochemists in the feasibility of an 
'RNA world' in the early stages of the formation of life (for a recent survey, see 
Penny 2004). 

At this point there are at least two ways to formalize the concept of a self- 
sustaining and autocatalytic set of molecules - the two we study here are referred to 
as the RAF (reflectively autocatalytic, and i^-generated) and CAF (constructively 
autocatalytic and i^-generated) sets. The former was investigated in Steel (2000) 
and Hordijk and Steel (2004). A CAF, which we formalize in this paper is a slightly 
stronger notion - it requires that any molecule m that is involved in any catalysation 
must already have been built up from catalysed reactions (starting from F). This 
concept is perhaps overly restrictive, since it might be expected that m would still 
be present in a random biochemical system in low concentrations initially before 
reactions that generate a steady supply of m become established. 

For the sequence-based models of the type studied by Kauffman, we determine 
the degree of catalysation required for a RAF or a CAF to arise. In Kauffman's 
model reactions consist of the concatenation and cutting of sequences up to some 
maximal (large) length, starting from small sequences of length at most t, and each 
molecule has a certain probability of (independently) catalysing any given reaction. 
Let /i(a;) denote the average number of (concatenation) reactions that sequence x 
catalyses, which may depend on |a:| the length of x. Then, roughly speaking, our 
results show that if ^{x)/\x\ is small the probability that the system contains a 
RAF is small; conversely if /x(x)/|x| is large the probability the system contains a 
RAF is close to 1, and indeed in this case there is likely to be a RAF for which all 
the molecules in the system are involved. This confirms two conjectures that were 
posed in Steel (2004) and confirms some trends that were suggested by simulations 
in Hordijk and Steel (2004). 

Our results for RAFs contrast sharply with the degree of catalysation required 
for a CAF. In that case each molecule needs to catalyse, on average, some fixed 
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proportion of all reactions for a CAF to be likely. That is, the corresponding value 
of /i(a;) required for a likely occurrence of a CAF is exponentially larger (with n) 
than for a RAF. 

We begin this paper by formalizing the concepts of RAF and CAF, and we do so 
in a more general setting than Hordijk and Steel (2004) as we consider the effect of 
general catalysation regimes - for example by allowing certain molecules to inhibit 
certain reactions. In this case determining whether an arbitrary catalytic reaction 
system contains a RAF seems to be computationally intractable. Indeed we show 
that the decision problem is NP-complete. This contrasts with the situation where 
one allows only positive catalysation; in that case a polynomial-time algorithm (in 
the size of the system) for finding a RAF if one exists was described in Hordijk 
and Steel (2004). Sections01and[Slpresent the main results concerning the required 
growth of fi{x) with |a;| required for RAF and CAR generation, and in Section|H|we 
make some concluding comments, and raise some questions for further investigation. 

Although the assumptions in Kauffman's original paper were quite strong - for 
example that each molecule had the same probability of catalysing any given re- 
action - in this paper we have been able to weaken some of these assumptions. 
The analysis in this paper still ignores inhibitory catalysis, side reactions that may 
deplete certain reactants (Szathmary 2000), and dynamical aspects of the process 
(Szathmary and Maynard Smith 1995) however we hope to extend this analysis in 
future work. 



2. Preliminaries and definitions 

We mostly follow the notation of Steel (2000) and Hordijk and Steel (2004). Let 
X denote a set of molecules and TZ a set of reactions, where we regard a reaction 
as an ordered pairs (A, B) where A, B are subsets of X called the reactants and 
products respectively. Let F be a distinguished subset of X, which can be regarded 
as some plentiful supply ('food') of reactants. 

For r e 7^ let p{r) = A and 7r(r) = B and for a set TV QTZ let 

p(n') := U,e7Z'p(r), 
TT{n') := U^e7Z'7r(r), 
and 

supp(7e') :=p(7^')U7^(7^')■ 

Thus supp(7?.') denotes the molecules in X that are used or produced by at least 
one reaction in TV . 

Given a subset TV of TZ and a subset X' of X the closure of X' relative to TV , 
denoted cl-jz'iX') is the (unique) minimal subset W oi X that contains X' and that 
satisfies the following condition for each reaction {A, B) G 7?.': 

ACX'UW^BCW. 

It is easily seen that cIti' {X') is precisely the set of molecules that can be generated 
starting from X' and repeatedly applying reactions selected (only) from TZ' . 
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Let ^ : 2-^ X TZ ^r {0, 1} be a catalysation function. The function 7 tells us 
whether or not each reaction r can proceed in its environment (eg. be 'catalysed') 
depending on what other molecules are present. Thus let 7(^, r) = 1 precisely when 
r would be catalyzed if the other molecules in the system comprise the set A. For 
example, consider a simple scenario where each reaction r G 7^ is catalysed provided 
that at least one molecule in some set (specific to r) is present. We can represent 
the associated function 7 as follows - we have a set C C X x R (as in Steel, 2000; 
Hordijk and Steel, 2004) where {x, r) indicates that molecule x catalyses reaction 
r. The catalysation function 7 = 7c for this simple setting is then defined by 

Jl, ii3x e A: {x,r) e C; 
I 0, otherwise. 



More generally, suppose we have two arbitrary sets C(+) C X x R and C(— ) C 
X X R, which can represent, respectively the molecules that catalyse and inhibit 
the various reactions. Then a candidate for 7 is the function 7 — 7c(+) c(-) defined 
by: 

(1) 

,^ , Jl, if 3a; e A : (a;,r) e C(+) and there is no a;' e A : (a;',r) e C'(-); 
7cw,c(-)(A0 = |q_ otherwise. 

Thus 7c(+).c(-) allows both catalysation and inhibition. We find it useful to write 

c(+) C(—) 
A ^ B to denote the reaction (A, B). Similarly, we will write A '■ > B to 

denote the reaction {A,B) together with a catalysation function that satisfies Q. 

When the sets A, B, C are singletons we will often omit the {} symbols. 

In case 7 is monotone in the first co-ordinate (i.e. A C B ^ 7(^4, r) < 'j{B,r)) 
we will call 7 monotone. Note that 7c is monotone, and that monotone catalytic 
functions do not allow inhibition effects. 

The triple Q ~ (X, i?, 7) is called a catalytic reaction system. 



2.1. Autocatalytic networks. Suppose we are given a catalytic reaction system 
Q = (X, n, 7) and a subset F oiX. 

A reflexive autocatalytic network over F or RAF for Q is a non-empty subset TZ' 
of TZ for which 

(i) p{n')Ccln'{F) 
(ii) For each r e R' , 7(supp(7^'), r) = 1. 

In addition, to avoid biological triviality, we will also require that any RAF TV also 
satisfies the condition 

(iii) 7^(7^') % F 
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Thus for TV to be a RAF, each molecule involved in TV must be able to be 
constructed from F by repeated applications of reactions that lie just in TV (con- 
dition (i)) and each reaction in TV must be catalysed by the system of molecules 
involved in TZ (condition (ii)). This definition is a slight generalization of that given 
by Hordijk and Steel (2004) to allow for more general catalysation functions 7 in 
condition (ii) . Condition (iii) simply ensures that any set of reactions that produce 
only molecules that are already in the food set F does not constitute a RAF. 

Next we describe a condition which is somewhat stronger than the RAF require- 
ment. 

A constructively autocatalytic network over F or CAF for Q is a strictly nested 
sequence ^ 7?.i C 7^2 C • ■ • C T?.*; , for which 

(i) p{Tli) C F and for each r G 7^l, 7(F,r) = 1. 
(ii) For all i e {!,... ,k — 1}, p{TZi+i) C supp(7^i), and for each r e TZi+i, 

7(supp(7^i),r') = 1. 
(hi) tt{TZi) % F. 

Informally, a CAF is a way to sequentially build up a set of molecules, starting 
with F, and in such a way that every reaction is catalyzed by at least one molecule 
that has already been constructed. Notice that for any catalytic reaction system 
Q, any set TZi occurring in a CAF for Q is also a RAF for Q. 

Figure n] illustrates these two concepts in the case of simple catalysation (of the 
form 7 = 7c)- 
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Figure 1. (a) An example of a RAF and (b) a CAF; represented 
as directed graphs. Molecules are shown as black nodes, reactions 
are white nodes, F — {fi, f2, fa, Ja} and each (positive) catalysa- 
tion of a reaction by a molecule is indicated by dashed arc. Solid 
arcs show the input and output of each reaction. 



There is a further condition we can impose on a RAF or CAF to make it more 
biologically relevant - namely we may require that a set of reactions is capable 
of constructing complex molecules required for maintaining certain biological pro- 
cesses (such as metabolism, error correction or reproduction). Of course there 
may be many combinations of complex molecules that suffice to maintain these 
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processes, but we would like TV to be able to construct at least one of these combi- 
nations. We can formalize this notion as follows. Suppose TV is a RAF (respectively 
i?i C 7?,2 C • • • C TZk — TZ' a CAF) and suppose il C 2'^~^ is a distinguished collec- 
tion of subsets of molecules. We say that TZ' is an (n)-complex RAF (respectively 
an (n)-complex CAF) if the following condition (flC) holds: 

(nC) an =^9 then C C T:{n') for at least one C efl. 

We can think of each set C G fJ as a suite of complex molecules that are required for 
maintaining certain biological processes and condition (fiC) requires that the RAF 
or CAF be capable of constructing at least one such set. Note that the definition 
of an rj-complex RAF (respectively Jl~complex CAF) reduces to that of a (simple) 
RAF or CAF if we take = 0. 



3. The complexity of determining whether or not Q has a RAF or a 

CAF 

In Hordijk and Steel (2004) it was shown that, when 7 = 7Ci there is a polynomial- 
time algorithm to determine if Q has a RAF. However if one allows inhibition also - 
by replacing 7c by 7 = 7c(+).c(-)~ it is unlikely that any efficient algorithm exists 
for determining a RAF, by virtue of the following result whose proof is given in the 
Appendix. 

Proposition 3.1. For arbitrary catalytic reaction systems Q = (X,TZ,"fc(+).C{~)) 
and a subset F of X the decision problems 'does Q have a RAF?' is NP-complete. 

However for any monotone catalysation function 7 there is a simple algorithm 
to determine whether or not Q has a CAF, which is essentially to let the system 
'evolve from F'. We describe this now. 

Proposition 3.2. Given a catalytic reaction system Q = (AT, 7?,, 7), with 7 mono- 
tone, there is a polynomial time algorithm (in \X\,\TZ\) to determine whether or 
not Q has a CAF. 

Proof. Define a sequence Xi,TZi for i > 1 as follows: 

ATi = F; 7^l = {r e 7^ : p{r) C F, and j{F, r) = 1}, 
and for i > 1 set 

X,+i = X, U Ti{Ri);R^+i ^R,U{ren: p{r) C X, and -^{X„r) = 1}. 

Then provided T^i 7^ the sequence TZi (= TZ2 ■ ■ ■ Q TZk (for any fc > 1) is a CAF 
for Q. If TZi is empty, then clearly Q has no CAF. D 

4. Random sequence-based models 

In this section we take X — X{n), the set of sequences of length at most n over 
the alphabet set {0, 1, . . . , k — 1}. Let i^ be a distinguished (small) subset of X{n); 
in this paper we will take F — X{t) for a fixed value of t (often a value such ast ~ 2 
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has been taken in earlier papers). For a sequence x G X{n) we will let \x\ denote 
its length. Let TZ{n) denote the set of all ordered pairs r =- (A, B) where, for some 
a^h^c^ X for which c = ah (^— the concatenation of a and h) either A — {a, 6} and 
B = {a^} - in which case we call r a forward reaction or yl = {ab} and B — {a, b} - 
in which case we call r a backward reaction. We may think of the pair r = ({a, 6}, c) 
as representing the ligation reaction 

a + b ^ c; 

and the pair r — ({c}, {a, b}) as representing the cleavage reaction 

c ^ a + b. 

We will let TZ+{n) and TZ-{n) denote the (partitioning) subsets oiTZ{n) consisting 
of the forward and backward reactions, respectively. 

Note that we have 

(2) x„ := |X(n)| = K + K^H h k" = -—, 

K — 1 

which is the total number of sequences of length at most n, and 

(3) r„ :^ |7^+(r^)| = (K2 + 2K^ + ...(n- 1)k") = -i ' 



(k-1)2 

which is the total number of forward reactions that construct sequences of length 
at most n. We will often below use the fact that, for all n > 1, 

(4) 1 - O f-^ < ^^ < 1 

\nj nxn 

(the notation f{n) = g{n) + 0{-) means \f{n) — g(n)\ < K/n for some constant K 
for all n > 1). 

We study a catalysation function 7 obtained by setting 7 = 7c where C is some 
random assignment of catalysation (i.e. pairs (a;, r)) that is subject to the following 
requirements: 

(Rl) The events {{x,r) G C : a; e X(n),r G TZ+{n)) are independent. 
(R2) For each sequence x G X(n) and reaction r G TZ+(n), the probability 
P[(a;,r) G C] depends only on x. 

This model is more general than that described in Kauffman (1993), Steel (2000) 
or Hordijk and Steel (2004) for several reasons - it allows different catalysation 
probabilities for forward and backward reactions, it allows dependencies involving 
the catalysation of backward reactions, and the catalysation ability of a molecule 
can vary according to the molecule considered (for example, it can depend on the 
length of the molecule) . 

Let Hn (x) be the expected number of reactions in TZ+ (n) that molecule x catal- 
yses. By (R2) we can write this as 

^inix)^¥[{x,r)eC]■\n+{n)\, 

for any given r G TZ^{n). 



RANDOM AUTOCATALYTIC NETWORKS 9 

For Q{n) = {X{n),TZ{n),^c), F = X{t) for some fixed value of t and Vl C 
2X{n)-F^ let Vni^) be the probability that Q{n) has an f}-complex RAF. We can 
now state the first main result of this paper. 

Theorem 4.1. Consider a random catalytic reaction system Q{n) satisfying (Rf) 
and (R2) and with F = X{t) for a fixed value of t, with t < n. Let A > and let 

(i) Suppose that fin{x) < An for all x G X(n). Then 

Vn{n) < 1 - exp(-2Aa;2(l + 0{-)) (^ as A ^ 0), 

n 

where Xt is defined in 0). 

(ii) Suppose that /i„(x) > An for all x £ X{n), or that Unix) > A6'„|x| for all 

X e X{n), where A > logg(K) and where 0„ = -(1 + ^^ ) ~ 1. Then, 



K{Ke~^'^* 

T 



Vnin) > 1 - _, ^ _^\ {^1 as X^ oo). 



To illustrate Theorem 14 . II consider binary sequences, and a food set consisting of 
the 6 molecules of length at most 2 (thus k = t = 2, which was the default setting 
for the simulations in Hordijk and Steel 2004). Then taking A = 4 in Theorem 
0;ii) we have Vn > 0.99. 

As an immediate corollary of Theorem 14. II we obtain the following result, which 
confirms the two conjectures posed in Steel (2000). 

Corollary 4.2. Consider random catalytic reaction systems Q{n) (n > t) satis- 
fying (Rl) and (R2) and with F = X{t) for a fixed value oft. Take Q = 0, and let 
Vn = Vni^), the probability that Q{n) has a RAF. 



(i) If 



x£X{n) n 

then lim„^oo 'Pn — 0. 
(ii) If 

fJ-nix) 



max > as n ^ oo 



mm 



xeX(n) \x\ 
then hm„^^7'„ = 1. 



Remarks 



• CoroUar V 14 . 21 has been worded in such a way that it clearly remains true if 

in {X 



we interchange the terms ^| | and ^"^^' in either part (i) or part (ii) or 
both. 

The condition described in Corollarv l4.2f ii'l suffices to guarantee (for large 
n) a RAF involving all the molecules in X{n). However it does not guar- 
antee that all of TZ+{n) is an RAF. The condition for this latter event to 
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hold with high probabihty as 71 — > oo (assuming for simphcity that /i(a;) is 
constant, say /i„, over X{n)) is the stronger condition that 

hminf ^ > \og^[hi). 

n — >oo 71" 

This foUows from (a shght extension of) Theorem 1 of Steel (2000). 

• Note that if we were to view a sequence (xi,X2-, ■ ■ -Xn) G ^(ji) and its 
reversal (a;„,a:„_i, . . . ,xi) as equivalent molecules then Corollary 14.21 still 
holds since asymptotically (with n) palindromic sequences have a negligible 
influence in the calculations. 

• Similarly, if we were to modify (R2) to require that any molecule x cannot 
catalyse a reaction r for which a; is a reactant, then Corollary 4.2 would still 
hold (and Theorem l4.1l would only be slightly modified) since the number of 
reactants in any reaction r is asymptotically negligible (with n) compared 
with the total number of molecules that could catalyse r. 

• Note that the lower bound on Vn in Theorem 14.1^ 11 is valid for any value 
ol n > t (previous studies, from Kauffman's (1986) onwards, had drawn 
conclusions by considering limits as n tended to infinity, but the bound 
in Theorem 14. If iil is independent of n). Thus, very large systems are not 
necessarily required for self-sustaining random autocatalysis, a concern that 
had been raised by Szathmary (2000). 

To establish Theorem 14. II we require first two further results - Lemma [4.81 and 
Proposition 14 . 41 and to describe them we introduce a further definition. 

We say that a reaction r e TZ{n) is globally-catalyzed (or GC) if there exists any 
molecule in X{n) that catalyzes r. By the assumptions (Rl) and (R2) above the 
probability that any forward reaction r is GC does not depend on r. Let p* denote 
this probability and let g, = 1 — p, . 

We will show that when p, is sufficiently large then there exists a RAF TZ C 
TZ+{n) such that X{n) — F C tt{TZ) - in other words all molecules that are not 
already supplied by F can be generated by catalyzed reactions. 

On the other hand, we will show that when p^ is small enough then the proba- 
bility that there exists any globally catalyzed reaction that generates any molecule 
from X{t + 1) from molecules in X{t) is small - thus proving that the probability 
that a RAF exists is small. 

The first step is to estimate the probability of global catalysation. 

Lemma 4.3. Consider the system Q{n) satisfying properties (Rl) and (R2) and 
with F — X{t) for a fixed t, and let X > be any positive constant. 

(i) The probability q^, that a reaction r G 7?._|_(n) is not globally catalyzed is 
given by 

■>.- n (i-^^'' 

xeX{n) ^ " 

In particular, 
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(ii) if ^j.n{x) < An for all x then 



g. >exp(-A(l + 0(-))); 
n 



(iii) if fin (x) ^ An for all x then 

(iv) if fin{x) > X9n\x\ for all x (where On is as defined in Theorem \4.1\ ) then 

Proof. Part (i) is immediate from (Rl) and (R2). Part (ii) follows by combining 
part (i) and Q to give: 

Part (iii) follow from part (i) together with Q which gives 

Tn / \ X 



q*<[l-—\ <(1-— I < e-\ 



as required. For part (iv), combine part (i), the identity \{x G X{n) : |a;| = s}| = k", 
and the inequality (1 — of' < exp(— a6) for a, 6 > 0, to obtain 

" / As\''° " Ask-* c " 

g*<n 1 <n^^P( — )-exp( VsK^). 

\ fn J -'■-'■ r„ rn ^-^ 

Now, 'YTs=\ ^^^ ~ '"'n+i/ 1^ from ((JJ, and part (iv) now follows by identifying 9n with 
^-:i±^ (again using Q). Note that 6'„ converges to 1 as n ^ oo. D 

Proposition 4.4. Consider a random catalytic reaction system Q{n) satisfying 
properties (Rl) and (R2) and with F = X{t) for a fixed t, where t < n. As before, 
denote the probability that a forward reaction is not globally catalyzed by q^. Then 

(i) The probability that Q{n) has a RAF is at most 1 — q* * 

(ii) If KQ:, < 1 then the probability that Q{n) has a RAF TZ with X{n) — F <Z 7r(7?.) 
is at least 

1 — Kg* 

Proof. Part (i). Note that there are at most 2x1 forward reactions whose reactants 

(inputs) lie in X{t). With probability q* ' none of these reactions is GC, in which 
case there is no RAF for the system. The first part of the proposition now follows. 

Part (ii). Note that, for any s > t the probability that a molecule x with 
|a;| = s + 1 is not generated by any forward GC reaction from X{s) is given by 
q^. Therefore the expected number of molecules x with |a:| = s + 1 which arc not 
generated by a forward GC reaction is K^'^^q^. In particular the probability that 
there is a molecule in X(s + 1) that is not generated by a forward reaction from 



\t 
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X{s) is at most n'^^^q^. This in turn implies that the probabihty that aU molecules 
in X{n) are generated by forward GC reactions is at least 

n oo / \t 

1 -nYiKq^r > 1 -nyinq^y = 1 - =1. 

Finally, note that if all molecules in X{n) — F are generated by a set TZ of forward 
GC reactions, and since t < n (so that condition (iii) in the definition of an RAF 
is satisfied) we have that 7^ is a RAF for Q(n). D 

Proof of Theorem \4-.l\ 

Part (i). By Proposition ^31 (i) the probability that Q{n) has a RAF is at most 
1 — (7* ' which by Lemma l4.3r iil is at most 

1 - [exp(-A(l + 0(-)))]2^' = 1 - exp(-2Ax2(l + 0{-))). 
n n 

Glearly if Q{n) has no RAF, then it also has no il-complex RAF, for any fi C 2^(")~^. 

Part (a) This follows, by combining Proposition 14.41 fii) with Lemma |4. 31 parts 
(iii) and (iv), and noting that a RAF TZ of Q{n) for which X(n) — F C tt{TZ) is also 
an r2-complex RAF for any n C 2-^(")-^. D 



5. An analogous result for CAFs 

The degree of catalysation required for a CAF to arise in the system Q{n) is 
much greater than for a RAF. This seems reasonable since the definition of a CAF 
involves a much stronger requirement than a RAF on a set of reactions. However 
the extent of the difference is interesting, and is given by the following analogue of 
Theorem 14. II 

Theorem 5.1. Consider the random catalytic reaction system Q{n) and suppose 
that F = X{t). Let X>0 and let Q C 2^^"'^^''. 

(i) If 

A 

for all X G X{n), then the probability that Q{n) has a fl-complex CAF is at 
most 

1 - (1 - 4)'"* < 2A. 
xi 

(ii) If 

for all X G X{n), then the probability that Q{n) has a fl-complex CAF is at 
least 

K{Ke^^y 

1 — KC^^' 
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Before presenting the proof of this result, we note that while the degree of cataly- 
sation required for the likely occurrence of a RAF was that fj,n{x) should grow at 
least linearly with n fTheoreni l4.1f) the requirements for a CAF are quite different: 
by Theorem 15.11 u^, (x) must grow at least linearly with r„ - and thereby exponen- 
tially with n. 

Proof of Theoremm\ Part (i). Let W := {r G 7^+(7^) : p{r) C F}, the set of 
all forward reactions that have all their reactants in F. The probability that any 
given reaction r G 7?,' is not catalyzed by at least one element of F is given by 

xeF " 

Thus the probability that none of the reactions in TZ' are catalyzed by at least 
one element of F is 



(]J(l_/f!iM))IK'l 



x£F 

In particular if finix) < ^^, then, since IFI — xt and IT^'I = 2a;?, this probability 
(that none of the reactions in TV is catalyzed by at least one element of F) is at 
least 

(5) (i_A)2.?>i_2A. 

Xt 

However when none of the reactions in TV is catalyzed, then Q{n) does not have a 
CAF. Thus the probability that Q{n) has a CAF is at most 1 minus the expression 
in (0), as required. 

Part (ii). For every molecule in a; £ X{n), and each s G {t, . . . ,n} let Es{x) 
be the event that there is at least one reaction r^ of the form a + 6 — > x, where 
a, 6 G X{s), that is catalysed by at least one molecule in X{s). 

Now, if ^n{x) > — ^, then for any forward reaction r, the probability that r is 
not catalysed by at least one molecule in X{s) (for s > t) is at most 

A , ^ , , , Xs 



(1 T'' < exp(-A— ) < e 

Xt Xt 



-A 



and since, for each x there are |x| — 1 choices for r^ we have 

(6) nEsixY) < exp(-A(|a:| - 1)), 

where Es{xY is the complementary event to Es{x). 

Consider the event 

Es := fl E,{x). 

xex{s+i)\x{s) 

By ® and the identity \X{s + 1) - X{s)\ = k''+^ we have P{E^) < K'+^e'^' and 
so 

n—t oc / — A\i 
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However the event f]"!^ Eg ensures that the nested collection of reactions TZi := 
{rx : X G X{t + i)}, i = 1, . . . ,n ~ t forms a CAF for Q{n), and moreover one for 
which the maximal set TZn-t generates all elements of X{n) — F - thus it is also a 
fi-complex CAF for any fi C 2'^(")~^. This completes the proof. 

D 



6. Discussion 

The question of how life first arose on earth is a multifaceted problem that stands 
out as one of the major questions in science (see for example Dyson, 1985; Fenchel, 
2002; Joyce, 1989; Szmathmary 1999; Szathmary and Smith, 1997). One dilemma, 
frequently dubbed the 'chicken and egg' problem is the question of which (if either) 
came first: hereditary (molecules such as DNA or RNA that carry information but 
do not easily catalyse reactions), or metabolism (proteins that carry out reactions 
but do not replicate). An alternative possibility is than an autocatalytic system of 
molecules including RNA and proteins and possibly other molecules emerged as the 
first primitive prebiotic system. The theoretical investigation of catalytic reaction 
systems is an attempt to address just one aspect of this theory. This concerns the 
issue of whether, as Kauffman has maintained, we should expect self-sustaining, 
autocatalytic networks to emerge in random chemical systems once some threshold 
(in 'complexity', 'connectivity' or 'catalysation rate') is exceeded, or whether there 
is the requirement of some fine-tuning of the underlying biochemistry for such 
networks to occur. Orgel (1992) raises this as concern about autocatalytic network 
models commenting that "it is always difficult in such theoretical models to see how 
to close the cycle without making unreasonable assumptions about the specificity 
of catalysis." 

Our results here have helped delineate precisely how much catalysation is re- 
quired in order for random sequence-based chemical reaction systems (without any 
'fine-tuning') to likely give rise to a RAF. In contrast to a CAF, where a high degree 
of catalysation is required when the maximal sequence length n is large, the likely 
occurrence of a RAF depends just on whether the catalysation function /x„(a;) grows 
sublinearly or superlinearly with n fCorollarv l4.2|l . The techniques developed in 
this paper may provide some analytical predictive tools for biochemists design in 
vitro prebiotic experiments with large number of variants of RNA sequences and 
other molecules. 

The development of a self-sustaining autocatalytic system would clearly be only 
one step towards life, in particular a reproducing system that is capable of under- 
going Darwinian selection eventually needs to develop. Here the recent concept of 
a 'Eigen-Darwin' cycle (Poole et al. 1999) may hold promise. 

Questions for future work would be to explore how the results in this paper 
would be influenced by allowing random inhibitory catalysations, or side reactions 
that could destroy some of the crucial reactants (this problem has been referred to 
by Szathmary (2000) as the "plague of side reactions"). This second phenomena 
can be formally regarded as a special case of the first, since if a; is a reactant for 
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a reaction r and x is degraded in the presence of another molecule y then we can 
(formally) regard y as inhibiting the reaction r. The model studied in this paper 
could also be refined to better suit the graph-theoretic properties of real metabolic 
networks which have recently been investigated (Jeong et al., 2000; Wagner and 
Fell, 2001). 
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8. Appendix: Proof of Proposition 13. II 

The decision problem is clearly in the class NP. To show it is NP-complete we 
provide a reduction from 3-SAT. Consider an expression P in conjunctive normal 
form involving binary variables xi, . . .a;„ and where each clause in P involves at 
most three variables. Thus we can write 

P = Ci A C2 A • • • A Cfc 

where 

and^(^),F(i)C{l,...,7^}, \T{t)\ + \F{t)\ =3. 

Given P construct a catalytic reaction system Q = {X, 7^, jc(+),c{-)) ^^ follows: 
let F := {xi, . . . , Xn}, let 

X :— {xi, . . . ,Xn, /l, . • ■ , fmh, ■ ■ ■ ,tn,dl, • • ■ ,^fe, !}■ 

Informally, Xi will correspond to the variable Xi in the formula; a reaction producing 
ti (respectively fi) will be catalyzed if the truth assignment of a;^ is true (respectively 
false), and the reaction producing 9i will be catalyzed if the i'th clause is satisfied. 

IVIore formally we let 7?. = T^i U 7<^2 U TZs where 7?,i consists of all reactions 

for \ < i < n. In words Xi is the sole reactant for fi and ti but fi inhibits the 
catalysation of ti and vice- versa. 

7^2 consists of all reactions 

i(+) 



tj-^^e, ifjeT(i), 
/, -^0, ifjeF(z), 
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for 1 < j < n and 1 < i < k. Finally TZs consists of the single reaction 

{0i,...,0,}^^l. 

Now we claim that P has a satisfying truth assignment if and only if Q has a 
RAF. To establish this, first assume that P has a satisfying assignment. Fix such 
an assignment z and let {T,F} be a partition of {1, . . . ,n} corresponding to the 
variables that are true (respectively false) in z. 

Now consider TZ[ U 7?.2 U 7?.3 where TZ'i C 7^^ consists of the reactions Xi -^ ti for 
all i <E T, and the reactions Xi -^ fi for all i € F. TZ^ will consist of the reactions 
ti —> 9j for ah I G T n T{j) and /^ ^ 6'^ for all i € F n F{j). Since the assignment 
z satisfies the formula it follows that TV^ U 7?.2 U TZ^ is a RAF. 

Next we have to show that if the system has a RAF the formula has a satisfying 
truth assignment. Suppose the system has a RAF TV . Clearly TZ^ C TZ' . This in 
turn implies that the reactions producing 0i, ... ,9k are all catalyzed. Thus for all 
1 < i < k, there either exists some j G T{i) such that the reaction producing tj 
is catalyzed or there exists some j G F(i) such that the reaction producing fj is 
catalyzed. Moreover, for all i at most one of the reactions producing ti and fi can be 
catalyzed. We now define Zi to be true if the reaction producing ti is catalyzed and 
false if the reaction producing fi is catalyzed (zi is defined arbitrarily otherwise). 
Then z is a satisfying assignment as required. 
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